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Abstract Here we show that, given a set of clusters C on a set of taxa 
where \X\ — n, it is possible to determine in time /(A:) ■poly{n) whether there 
exists a level-< k network (i.e. a network where each biconnected component 
has reticulation number at most k) that represents all the clusters in C in the 
softwired sense, and if so to construct such a network. This extends a polyno- 
mial time result from [12]. By generalizing the concept of "level-A; generator" 
to general networks, we then extend this fixed parameter tractability result to 
the problem where k refers not to the level but to the reticulation number of 
the whole network. 

Keywords Phylogenetics • Fixed Parameter Tractability • Directed Acyclic 
Graphs 



1 Introduction 

1.1 Phylogenetic networks and softwired clusters 

The traditional model for representing the evolution of a set of species X (or, 
more abstractly, a set of taxa) is the rooted phylogenetic tree |24l l8l[9] . Essen- 
tially, this is a singly-rooted tree where the leaves are bijectively labelled by X 
and the edges are directed away from the root. In recent years there has been a 
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growing interest in extending this model to also incorporate non-treelike evo- 
lutionary phenomena such as hybridizations, recombinations and horizontal 
gene transfers. This has subsequently stimulated research into rooted phyloge- 
netic networks which generalize rooted phylogenetic trees by also permitting 
nodes with indegree two or higher, known as reticulation nodes, or simply 
reticulations. For detailed background information on phylogenetic networks 
we refer the reader to [T4 l l2T | [23 l ll3 p 28| . [T5] . Figure [l] shows an example of a 
rooted phylogenetic network. 




a be defghijkl 



Fig. 1 Example of a phylogenetic network with five reticulations. The encircled subgraphs 
form its biconnected components, also known as its "tangles" . This binary network has level 
equal to 2 since each biconnected component contains at most two reticulations. 

We are interested in the following biologically-motivated optimization prob- 
lem. We are given a set C of clusters on X , where a cluster is simply a strict 
subset of X. We wish to construct a phylogenetic network that "represents" all 
the clusters in C such that the amount of reticulation in the network is "mini- 
mized" . There are several different definitions of "represents" and "minimized" 
present in the literature. In this article we will consider only the softwired def- 
inition of "represents" [IS l lS m iM l fTS] . Most of our formal definitions will be 
deferred to the preliminaries. Nevertheless, it is helpful to already formally 
state that a rooted phylogenetic tree T on X represents a cluster C C X ii T 
contains an edge (u, v) such that C is exactly equal to the subset of X reachable 
from V by directed paths. A phylogenetic network N on X , on the other hand, 
represents a cluster C C X in the softwired sense if there exists some rooted 
phylogenetic tree T on X such that T represents C and T is topologically em- 
bedded inside N. Regarding "minimized" , we consider two closely related, but 
subtly different, variants of minimality. The first variant, reticulation number 
minimization, aims at minimizing the total number of reticulation nodes in the 
networlPl The second, less well-known variant, level minimization |18 U 17 [ [26 l 



^ This is the definition when all reticulation vertices have indegree-2, for more general 
networks reticulation number is defined slightly differently. See the Preliminaries for more 
information. 
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I29p25j , asks us to minimize the maximum number of reticulation nodes con- 
tained in any "tangled" region of the network, which essentially correspond 
to the non-trivial biconnected components of the underlying undirected graph 
(see Figure [ij. The reticulation number is a global optimality criterion, while 
the level is a local optimality criterion. In general minimizing for one variant 
does not induce minimum solutions for the other variant (see e.g. Figure 3 of 
[13]). although the algorithmic techniques used to tackle these problems are 
often related ^19j . 

Both these problems are NP-hard and APX-hard [511^ . This raises the nat- 
ural question: is it NP-hard to minimize the reticulation number (respectively, 
the level) if the number of reticulation nodes in the network (respectively, 
per tangled region) is fixed? Prior to this article there were only partial an- 
swers known to these questions. In |19j it was proven that level-minimization 
is polynomial-time solvable if the level is fixed. A striking aspect of this proof 
is that the running time of the algorithm is only polynomial time in a highly 
theoretical sense: it is too high to be of any practical interest. This exorbitant 
running time has two causes. Firstly, the exhaustive enumeration of all gener- 
ators [55] , essentially the set of all possible underlying topologies of a network 
if the taxa are ignored. Secondly, after determining the correct generator, a 
second wave of exhaustive enumeration determines where a critical subset of 
X should be located within the network, after which all remaining elements 
of X can easily be added without much computational effort. 

The question of whether a corresponding positive result would hold for 
reticulation number minimization was left open, although the emergence of 
several partial results and practically efficient algorithms [l3llT9| suggested 
that this might well be the case. Furthermore, it was not obvious how the 
algorithm from [ini could be adapted to yield a fixed parameter tractable 
algorithm for level minimization - where the parameter is the level of the 
network k - since k appears as an exponent of \X\ in the running time of 
the algorithm. (We refer to j6ll22ll5lfT0] for an introduction to fixed parameter 
tractability) . Curiously, the main problem is not the enumeration of the gen- 
erators, because the number of generators is independent of \X\ [7], but the 
allocation of the critical initial subset of taxa to their correct location in the 
network. 

In this article we settle all these questions by proving for the first time 
that both level minimization and reticulation number minimization are fixed 
parameter tractable (where, in the case of reticulation number minimization, 
the parameter is the reticulation number of the whole network). We give one 
algorithm for level minimization and one algorithm for reticulation minimiza- 
tion, although the two algorithms have a large common core. The algorithms 
again rely heavily on generators, which we extend here to also be useful in 
the context of reticulation number minimization; generators had hitherto only 
appeared in the level minimization literature. In both algorithms the major 
non-triviality is showing how the network structure can still be adequately 
recovered if the parameter is no longer allowed to appear in the exponent of 
\X\ as it was in [19]. 
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1.2 Beyond softwired clusters: the wider context 

We believe that this approach is significant beyond the softwired cluster liter- 
ature. Other articles discuss the problem of constructing rooted phylogenetic 
networks not by combining clusters but by combining triplets |27p29j , charac- 
ters [T H [T2 | [33 l [20 ] or entire phylogenetic trees into a network. These models 
are in general mutually distinct although they do have a significant common 
overlap which reaches its peak in the case of data derived from two phyloge- 
netic trees. To see this, note that if one takes the union of clusters represented 
by a set of two or more phylogenetic trees, then the reticulation number (or 
level) required to represent these clusters is in general less than or equal to 
the reticulation number (or level) required to topologically embed the trees 
themselves in the network, and this inequality is often strict. However, in the 
case of a set comprising exactly two trees the inequality becomes equality [55] . 
Hence for data obtained from two trees one could solve the reticulation number 
minimization and level minimization problems for clusters by using algorithms 
developed for the problem of topologically embedding the trees themselves into 
a network. These algorithms are highly efficient and fixed parameter tractable 
in a practical, as opposed to solely theoretical sense [Tll^HllHT]. However, these 
tree algorithms do not help us with more general cluster sets, because for 
more than two trees the optima of the cluster and tree models start to di- 
verge. Indeed, the cluster model often saves reticulations with respect to the 
tree model by weakening the concept of "above" and "below" in the network, 
which is exactly why the input tree topologies do not generally survive if one 
atomizes them into their constituent clusters [28;. Moreover, the literature on 
embedding three or more trees into a network is not yet mature, with articles 
restricting themselves to preliminary explorations [32 l ll6 j . It therefore seems 
plausible that the generator approach might be adapted to the tree model (or 
the other constructive methods mentioned) to yield a unified technique for 
producing positive complexity results for reticulation number minimization 
and level minimization, even in the case of many input trees (or data obtained 
from many input trees). 



2 Preliminaries 

Consider a set of taxa X , where \X\ = n. A rooted phylogenetic network (on X), 
henceforth network, is a directed acyclic graph with a single node with inde- 
gree zero (the root), no nodes with both indegree and outdegree equal to 1, 
and nodes with outdegree zero (the leaves) bijectively labeled by X. In this 
article we usually identify the leaves with X. The indegree of a node v is 
denoted S~{v) and v is called a reticulation if S~{v) > 2, otherwise u is a 
tree node. An edge {u, v) is called a reticulation edge if its target node w is a 
reticulation and is called a tree edge otherwise. When counting reticulations 
in a network, we count reticulations with more than two incoming edges more 
than once because, biologically, these reticulations represent several reticulate 
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evolutionary events. Therefore, we formally define the reticulation number of 
a network N = {V,E) as 

r{N)^ J2 {S-iv)-l) = \E\-\V\ + l . 

A rooted phylogenetic tree on X , henceforth tree, is simply a network that 
has reticulation number zero. We say that a network N on X displays a tree T 
if T can be obtained from N by performing a series of node and edge deletions 
and eventually by suppressing nodes with both indegree and outdegree equal to 
1, see Figure [2] for an example. We assume without loss of generality that each 
reticulation has outdegree at least one. Consequently, each leaf has indegree 
one. We say that a network is binary if every reticulation node has indegree 2 
and outdegree 1 and every tree node that is not a leaf has outdegree 2. 

Proper subsets of X are called clusters, and a cluster C is a singleton if 
\C\ = 1. We say that an edge {u, v) of a tree represents a cluster C C A' if C is 
the set of leaf descendants of w. A tree T represents a cluster C if it contains 
an edge that represents C. It is well-known that the set of clusters represented 
by a tree is a laminar family, often called a hierarchy in the phylogenetics 
literature, and uniquely defines that tree. We say that N represents C "in the 
softwired sense" if N displays some tree T on X such that T represents C, see 
Figures [2] and [3j In this article we only consider the softwired notion of cluster 
representation and henceforth assume this iniplicitljj^ A network represents 
a set of clusters C if it represents every cluster in C (and possibly more). 
The set of all softwired clusters represented by a network can be obtained as 
follows. For a network N , we say that a switching of N is obtained by, for each 
reticulation node, deleting all but one of its incoming edges. Given a network 
N and a switching Tjv of A'^, we say that an edge [u, v) of A^ represents a cluster 
C w.r.t. Tjv if {u, v) is an edge of T/y and C is the set of leaf descendants of v 
in Tat. The set of all softwired clusters represented by A^, denoted C{N), is the 
set of clusters represented by all edges of A' w.r.t. T/v, where Tjv ranges over 
all possible switchings lA^. Note that the set of all possible switchings of A^ 
coincides with the set of all trees displayed by A^. It is also natural to define 
that an edge {u, v) of A^ represents a cluster C if there exists some switching 
Tjv of A^ such that [u, v) represents C w.r.t T^. Note that, in general, an edge 
of A^ might represent multiple clusters, and a cluster might be represented by 
multiple edges of A^. 

Given a set of clusters C on X, throughout the article we assume that, 
for any taxon x € X, C contains at least one cluster C containing x. For 
a set C of clusters on X we define r{C) as min{r(A^)|A^ represents C}, we 
sometimes refer to this as the reticulation number of C. The related concept 
of level requires some more background. A directed acyclic graph is connected 
(also called "weakly connected") if there is an undirected path (ignoring edge 
orientations) between each pair of nodes. A node (edge) of a directed graph is 



^ Alternatively, we say that a network N represents a cluster C G X "in the hardwired 
sense" if there exists a tree edge (u, v) of A'' such that C is the set of leaf descendants of v. 
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(a) (b) (c) 

Fig. 2 A phylogenetic tree T (a) and a phylogenetic network A'^ (b,c); (b) illustrates in grey 
that N displays T (deleted edges are dashed); (c) illustrates that N represents (amongst 
others) the cluster {c, d, e} in the softwired sense (dashed reticulation edges are "switched 
off"). 



called a cut-node [cut-edge) if its removal disconnects the graph. A directed 
graph is biconnected if it contains no cut-nodes. A biconnected subgraph B 
of a directed graph G is said to be a biconnected component if there is no 
biconnected subgraph B' ^ B oiG that contains B. A phylogenetic network is 
said to be a level- < k network if each biconnected component has reticulation 
number less than or equal to A network is called simple if the removal of a 
cut-node or a cut-edge creates two or more connected components of which at 
most one is non- trivial (i.e. contains at least one edge). A (simple) level-< k 
network N is called a (simple) level-k network if the maximum reticulation 
number among the biconnected components of N is precisely k. For example, 
the network in Figure [l] is a level- 2 network (which is not simple), the network 
in Figure 3(a)| is a simple level-4 network and the network in Figure |3(b)| is 



a simple level-2 network. Note that a tree is a level-0 network. For a set C 
of clusters on X we define 1{C), the level of C, as the smallest fc > such 
that there exists a level- /c network that represents C. It is immediate that for 
every cluster set C it holds that r(C) > 1{C), because a level-fc network always 
contains at least one biconnected component containing k reticulations. 

We say that two clusters Ci , C2 C X are compatible if either Ci n C2 = 
or Ci C C2 or C2 C Ci, and incompatible otherwise. Consider a set of clus- 
ters C. The incompatibility graph IG{C) of C is the undirected graph {V, E) that 
has node set y = C and edge set = {{Ci, C2} | Ci and C2 are incompatible 
clusters in C}. We say that a set of taxa X' Q X \s compatible with C if every 
cluster C G C is compatible with X' , and incompatible otherwise. 

We say that a set of clusters C on X is separating if it is incompatible with 
aU sets of taxa X' such that X' C X and \X'\ > 2. 

When we write f{k) we mean "some function that only depends on fc". For 
simplicity we overload /(fc) to refer to multiple different functions with this 
property. We write poly{n) to mean "some function f{n) that is polynomial in 
rt", where \X\ — n. As in the case of f{k), we often overload this expression. 



Note that to determine the reticulation number of a biconnected component, the inde- 
gree of each node is computed using only edges belonging to this biconnected component. 



Constructing minimal softwired networks is fixed parameter tractable 



7 



h 



d 




b 



c 



c 



(a) 



(b) 



Fig. 3 Two examples of networks that represent, among others, the set of clusters 
C = {{a,b,f,g,i}, {a,b, c, f, g, i}, {a,b,f,i}, {b,c,f,i}, {c,d,e,h}, {d,e,h}, {b,c,f,h,i}, 
{b,c,d, f,h,i}, {b,c,i}, {a,g}, {b,i}, {c, i}, {d,h}}. The network in (a) is a simple level-4 
network, and the network in (b) is a (binary) simple level-2 network. 

Indeed, the goal of this article is not to derive exact expressions for the running 
time, but to show that it is bounded above by f{k) ■poly{n). It should be noted 
that the f{k) that we encounter in this article can be extremely exponential in 
k. Also, \C\ can in general be exponentially large as a function of n, but (as we 
shall see in due course) it is reasonable to assume that \C\ is bounded above 
by /(fc) ■ poly{n) when the parameter k (reticulation number or level) is fixed. 

The next lemma ensures that, if our goal is to find a network representing 
a set of clusters and minimizing the level or the reticulation number, we can 
restrict our attention to binary networks: 

Lemma 1 \19^ Let N he a network on X . Then we can transform N into a 
binary network N' such that N' has the same reticulation number and level as 
N and all clusters represented by N are also represented by N' . 

Thanks to Lemma [l] we may assume that there exists a binary network 
N with reticulation number r(C) (or with level 1{C) if we are interested in 
level minimization) that represents C. We henceforth restrict our analysis to 
binary networks and, except in places where it might cause confusion to not 
be explicit, we will not emphasize again that we only deal with this kind of 
network. 



3 Minimizing level is fixed parameter tractable 

The aim of this section is to show that level-minimization is fixed parameter 
tractable. To compute 1{C), we will repeatedly query, "Is 1{C) = k? If so, con- 
struct a network with level equal to k that represents C " for k — 0, . . . , ^(C), 
where k starts at and is incremented by 1 until the query is answered pos- 
itively. Assuming that the queries are correctly answered, this process will 
terminate after 1{C) -\- 1 iterations. Hence, to prove an overall running time of 
f {1(C)). poly (n), it is sufficient to show that for each k we can correctly answer 
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the query in time at most f{k) ■poly{n). Note that r(C) = 1{C) = if and only 
if all the clusters in C are pairwise mutually compatible, which can be easily 
checked in time poly{n), so we henceforth assume that k > 1. 

The high-level idea is the following. In [HUlfH] it is shown that level-A: 
networks can be constructed using a divide and conquer strategy. Informally, 
the idea is to construct a level-< k network for each connected component 
of IG(C) and then to combine these into a single network. The clusters in 
each connected component first have to be processed, which creates (for each 
component) a separating set of clusters. From Lemma 1 of [12], we know 
that, if a level- A: network representing a separating set of clusters C ori X 
exists, a simple level-fc network representing C has to exist. This network will 
never have two or more taxa with the same parent |19j . The transformation 
underpinning Lemma[T]furthermore allows us to assume that this simple level- 
k network is binary. Hence, the divide and conquer strategy essentially reduces 
to constructing binary simple level-< k networks for separating sets of clusters 
(and then combining them into a single network). 

In Section [3. 1| we show how to construct a simple level-fc network in time 
f{k)-poly(n) from a separating set of clusters. Subsequently we show in Section 



3.2 how to combine these networks in time /(fc) • poly{n) into a single level-A: 
network. 



3.1 Constructing simple networks from separating cluster sets 

Before proving the main result of this paper, we need to prove some preliminary 
results. 

Proposition 1 Given a simple level-k network N and a set of clusters C on 
X, checking whether C is represented by N can be done in time f{k) ■poly{n), 
where n = \X\ . 

Proof. Note that there are at most 2*^ trees displayed by N and each tree 
represents at most 2(n— 1) clusters. This means that |C(Af)| is at most 2'^+^(n— 
1). Since N cannot represent C if |C| > \C{N)\, checking whether C Q C{N) 
takes at most f{k) ■ poly{n) time. □ 

Thus, if |C| > 2'^+^(n — 1), since C is assumed to be separating, it is not 
possible that 1{C) — k and we can immediately answer "no" to the query. We 
thus henceforth assume that \C\ < 2'^+^(n — 1) i.e. that C contains at most 
/(fc) ■ poly{n) clusters. 

If all the leaves of a binary simple level-fc network N are removed and all 
nodes with both indegree and outdegree equal to 1 are deleted, the resulting 
structure is called a level-k generator as defined in |26) . See Figure [4] for the 
level-1 and level-2 generators. The number of level-fc generators is bounded by 
/(fc)^ 

* Note that the number of level-fc generators grows rapidly in fe, lying between 2*~^ and 
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1 2a 2b 2c 2cl 



Fig. 4 The single level-1 generator and the four level-2 generators. Here the sides have been 
labelled with capital letters. 

The sides of a level-/c generator are defined as the union of its edges (the 
edge sides) and its nodes of indegree-2 and outdegree-0 (the node sides). The 
number of sides in a generator is bounded by /(fc), because the sum of its 
vertices and edges is hnear in k 

Definition 1 The set (for A: > 1) is defined as the set of all networks that 
can be constructed by choosing some level-fc generator G and then applying 
the following leaf hanging transformation to G such that each taxon of X 
appears exactly once in the resulting network. (This is essentially identical 
to the definition given in [29j, which is only a superficial refinement of the 
definition given in |26j). 

f. First, for each pair u^v of vertices in G connected by a single edge (u,?;), 
replace (u, v) by a path with I > internal vertices and, for each such 
internal vertex w, add a new leaf w' , an edge (?x;,w'), and label w' with 
some taxon from X. All the taxa added in this way are "on side s" where 
s is the side corresponding to the edge (m, v). (ft is also permitted that the 
path has zero internal nodes i.e. that the side remains empty). 

2. Second, for each pair u, v of vertices in G connected by two edges, treat 
the two edges as in step f , but ensure that at least one of the two paths 
does not have zero internal nodes. 

3. Third, for each vertex u of G with indegree 2 and outdegree add a new 
leaf y, an edge {v^y) and label y with a taxon x £ X] we say "taxon x is 
on side s" where s is the side corresponding to vertex v. 

The main reason for step 2 in Definition [T] is to ensure that multi-edges 
in generators do not survive in the final network, because our definition of 
phylogenetic network does not allow multi-edges. The following lemma follows 
directly from the results in Section 3.f of [25] : 

Lemma 2 The set (for k > 1) is equal to the set of all binary simple 
level-k networks. 

For example, the simple network in Figure |3(b)| has been obtained from 
generator 2a (see Figure |4| by putting taxa on sides A and D, 1 taxon on 
side F, 2 taxa on side B and 3 taxa on sides C and E. 
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By Lemma 1 of [19 and Lemmas 1 and 2, we have the following: 

Corollary 1 Let C he a separating set of clusters on X , such that l[C) > 1. 
Then there exists a network N in Af^^^^ such that N represents C. 

Given a taxon set X , we call any network resulting from adding all taxa in 
X to sides of a generator G (in the sense of Definition [T]) a completion of G on 
X. Here we call a side that receives > 2 taxa a long side, a side that receives 1 
taxon a short side and a side that receives taxa an empty side. Figure |3(b)| 
is thus a completion of generator 2a, where sides A and D are empty, side F 
is short, and sides B,C,E are long. Note that node sides (such as F in the 
example) are always short, but not all short sides are node sides i.e. edge sides 
can be short too. 

Given a generator G, we call a set of side guesses for G, denoted by 5*^, a 
set of guesses about the type of each side of G (i.e. whether it is empty, short, 
or long). A completion of G on A" respects Sq if all sides that are long in Sq 
receive at least 2 taxa in N, sides that are short in Sq one taxon and empty 
sides zero taxa. Then we have the following result: 

Observation 1. Searching in the space of all binary simple level-k networks 
on X is equivalent to searching in the space of all completions of a level-k 
generator G respecting a set of side guesses Sq, iterating overall all sets of 
side guesses for a generator and all level-k generators. 

Let G be a level-fc generator and let Sq be a set of side guesses for G. We 
say that the pair (G, Sq) is side-minimal w.r.t. a separating cluster set C on 
X and k, if there exists a completion of G on A" respecting Sq that is a 
level-fc network representing C and, amongst all simple level-fc networks that 
represent C, A^ has a minimum number of long sides, and (to further break 
ties) amongst those networks it has a minimum number of short sides. 

We define an incomplete network as a generator G, a set of side guesses 
Sg, a set of finished sides (i.e. those sides for which we have already decided 
that no more taxa will placed on them), a set oi future .sides (i.e. those short 
and long sides that have had no taxa allocated yet) and at most one long side 
on which at least one taxon has already been placed but where we might still 
want to add some more taxa. We call this the active side. A valid completion of 
an incomplete network is an assignment of the unallocated taxa to the future 
sides and (possibly) above the taxa already placed on the active side, that 
respects Sq and such that the resulting network (which we call the result of 
the valid completion) represents C. Informally, the result of a valid completion 
is any network on X respecting Sg and representing C that is obtained by 
respecting all placements of taxa made thus far. 

For example, consider again the network in Figure |3(b)| Let A^ be the net- 
work in that figure and let A^' be the network obtained from A^ by deleting 
taxa c, d, e and suppressing the resulting vertices with indegree and outdegree 
both equal to 1. Let G be generator 2a, and let Sg be the set of side-guesses 
where sides A and D are empty, side F is short, and sides B,C,E are long. 
Sides A, B, D, E are finished, is a future side and G is the active side. In 
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particular, we can perform a valid completion of this incomplete network A^' 
by putting taxa d and e above taxon h on side C and then putting c on side 
F. In this case, N is the result of the completion, although in general an in- 
complete network might have many valid completions, or none. 

Given a cluster set C, we write x — >c y if and only if every non-singlcton 
cluster in C containing x, also contains y. Then we have the following result. 

Proposition 2 Given a separating set of clusters C on X and an ordered 
set of distinct taxa of X (xi, . . . such that k > 2 and Xi — >-c x^+i for 
1 < J < (fc — 1). Then Xk t^c Xi. 

Proof. If Xi ~>c ^i+i for 1 < i < (fc~ 1) and Xk — >c ^i: this means that the set 
X' = Ui^iXi is compatible with C. Since \X'\ > 2, we have a contradiction. □ 

The following observations will be useful to prove Lemma |3] 

Observation 2. Let N be a phylogenetic network on X representing a set of 
clusters C on X constructed by choosing some level-k generator G and then 
applying the leaf hanging transformation described in Definition^to G. If two 
taxa X and y in X are on the same side of the generator underlying N and the 
parent of x is a descendant of the parent of y, then y — >-c x. 

Given a simple phylogenetic network N , we say that a side s' is reachable 
from a side s in A'^ if there is a directed path in the generator underlying N 
from the head of side s to the tail of side s' . 

Observation 3. Let N be a phylogenetic network on X representing a set of 
clusters C on X constructed by choosing some level-k generator G and then 
applying the leaf hanging transformation described in Definition^to G. More- 
over, let X and y be two taxa of X on the same side s of the generator under- 
lying N such that y -^q ^ f^i^d let z be a taxon on a side s' ^ s such that s' is 
not reachable from s and z — )-c x. Then we have that z -^c V- 

Proof. Since z -^c x, we know that every non-singleton cluster that contains 
z also contains x. Now, let C such a cluster. C is represented by some tree 
T displayed by N , so some edge e in T is such that C is the set of all taxa 
reachable from directed paths from the head of e. Now, z and x are both in 
C, so there a directed path from the head of e to z and a directed path from 
the head of e to a;. Since s' is not reachable from s, the only way that such a 
directed path can reach x is via the parent of y, hence the fact that z — >c U- 
If no cluster C containing z and x exists, since z — >c x we have that the 
only cluster containing z is the singleton cluster {z}. Then, obviously, z -^c V 
too. □ 

Observation 4. Let N be a phylogenetic network on X representing a set of 
clusters C on X constructed by choosing some level-k generator G and then 
applying the leaf hanging transformation described in Definition [7] to G. Let 
X and y be two taxa in X on the same side s of the generator underlying N 
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(a) (b) 



(c) 



Fig. 5 Three examples of the N{1, s) operation, (a) N{1, s) when s is an unfinished short 
node side; (b) N{1, s) when s is an unfinished short edge side (or a long side that does not 
yet have any taxa); (c) N{1, s) when s is a long side that already has at least one taxon. 



such that there exists an edge e from the parent of y to the parent of x. Then 
e represents all clusters in C containing x but not y. 

Observation 5. Let C be a separating cluster set on X . Then every size-2 
subset of X is incompatible with C. 

Let TV be a simple pliylogenetic network; iV, I a taxon and s a side of 
the generator G underlying N . We denote by N{1, s) the following operation, 
where we exclude the case from consideration where s is a short side that al- 
ready has a taxon on it. If s is a short side, then N(l, s) is simply the network 
obtained by putting I on side s (in the sense of Definition [T]) . Otherwise, s is a 
long side, and then N{1, s) is the network obtained by placing I "just above" 
the highest taxon on side s. If there are not yet any taxa on side s then we 
simply let I be the first taxon on side s. (See Figure [s] for clarification). 

We are now ready to analyse Algorithm [l] which is a critical subroutine. Let 
us assume that we have an incomplete network N with an active side s (which 
is by definition long) such that all long sides s' ^ s that are reachable from s, 
are finished . These preconditions will be motivated in due course. Informally, 
Algorithm [l] lets us decide whether we should continue adding taxa to the top 
of the active side, or stop and declare it finished. (In fact, the algorithm is 
rather more complicated than that, because a side-effect of the algorithm is 
that it sometimes adds taxa to unfinished short sides, irrespective of whether 
it has chosen to add a taxon to the top of the active side). Algorithm [2] will 
repeatedly call Algorithm [l] until it finally declares the active side finished. 
This is all ultimately driven by the main algorithm. Algorithm |3j which - 
amongst other tasks - then identifies and initialises (i.e. places a first taxon at 
the bottom of) a new active side. 
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Lemma 3 Let C be a separating set of clusters on X and let k he the first in- 
teger for which a level-k network representing C exists. Let N he an incomplete 
network such that its underlying generator G and set of side guesses Sq are 
such that {G,Sg) *s side-minimal w.r.t. C and k, and let s be an active side 
of N. Then, if a valid completion for N exists, Algorithm^ computes a set of 
(incomplete) networks N such that this set contains at least one network for 
which a valid completion exists. 

Proof. Recall that, from Corollary [l] we can restrict our search to networks 
in . Wc write X{N) to denote the set of taxa present in a (incomplete) 
network N . For a set of clusters C on. X and a subset X' C X, we define the 
restriction of C to X' as {C f^ X'\C € C} . We start the proof by analyzing the 
case when U — % (see Algorithm [T] for the definition of X' , U, B{1), etc). 

Case U — 0. Suppose \L'\ ^ 1. If \L'\ — then there are two possibilities. If 
L = then clearly no taxon / can be placed directly above Xi on s, because that 
would mean I — >-c Xi, and thus / S L, contradiction. Hence the only correct 
move is to declare that the side s is finished and return TV. If L ^ then, 
since \L'\ — 0, we have that, for every I E L there exists some I' £ L such that 
I I' and I — )-c I' . Clearly the — )-c relation is not allowed to create cycles in L, 
because otherwise the set of taxa in the cycle would form a cluster compatible 
with C (see Proposition [2]). Suppose we start at an arbitrary taxon in L and 
perform a non-repeating walk on the taxa of L by following the — >c relation. 
Given that L is of finite size and this walk cannot visit a taxon of L that it 
has already visited earlier in the walk (thus creating a cycle), we will find a 
taxon / S L such that there is no I' G L such that I ^ I' and I — >-c I' , meaning 
that I E L' , contradiction. So the case that L ^% but L' — 0, cannot actually 
happen. Now, consider the case that \L'\ > 2. Algorithm [l] will always end 
the side s and return N in this case. Indeed, no valid completion of N can 
have some taxon p that has not yet been allocated above Xi on side s. Suppose 
this is not true. Clearly, from Observation [2j p -^c Xi, so p E L. In this case, 
all taxa in L' are either equal to p, or underneath p and above Xi. Indeed, let 
I p he a taxon in L' and suppose, for the sake of contradiction, that I is 
above p on side s or on another side s' . If I is above p on side s, then from 
Observation [2] we have that I -^c P- If ' is on another side s' , the fact that 
\U\ = implies that there is no room under side s so, by Observation |3] we 
have that I -^c P- Thus, in both cases (i.e. if I is above p on side s or on a 
different side s') we have that I -^c P, meaning that I ^ L', contradiction. We 
can hence conclude that each taxon in L' is either equal to p, or underneath 
p and above Xi in any completion of N where p is on s. But, however one 
arranges two or more taxa on one side, at least one taxon will imply another 
taxon in the sense of the — >-c relation. More formally, in any case there exist 
two taxa / and I' in L' such that / 7^ I' and I — )-c I' ■ This implies that I ^ L' , 
contradiction. This concludes the correctness of the case \L'\ 7^ 1. 

We now consider the case when \L'\ — 1. Let I be the only taxon in L' . In 
this case. Algorithm [T] will return N if B{1) ^ 0. Indeed, no valid completion 
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Algorithm 1: a(l(l()iiSi(k>(.Y. .s) 



X' ^ X\X{N); 

Xi <— the most recent taxon inserted on side s; 

L<- {i(^x'\ I -^c ^i}; 

L' ■;— {/ e L| there docs not exist I' & L such that V ^ I and / — >c 

U {s'\s' ^ s is a side of N that is not yet finished and is reachable from s}; 

foreach I G L' do 

S(l) = [J{C eC\xieC and I ^ C}; 

B{i) = x' r\S(i). 

it U = $ then 

if \L'\ ^1 then declare s as finished in N and return N; 

I i— removeFirst(L'); 

if B{1) 7^ then declare s as finished in and return TV; 
if N(l, s) does not represent C restricted to X(N) U {1} then declare s as 
finished in N and return N; 
else 

|_ return N{1, s); 



else 



if L' = then 
|_ declare s as finished in N and return N; 

if \L'\ > 2 then 

Af i— N, where s is declared as finished; 
if \L'\ < \U\ then 

Af' <— the set of networks obtainable from N by allocating all taxa in 
L' to sides in U; 

if \L'\ - 1 < |f7| then 
foreach I G L' do 

J\f' <— the set of networks obtainable from N{1, s) by allocating all 
taxa in L' \ {(} to sides in U ; 

return Af; 

if |L'| = 1 then 

I ■;— removcFirst(L'); 
Af^fD; 

if B{1) ^ then 

Af <— N, where s is declared as finished; 
foreach side s' & U do 
[_ Af <-AfuN{l,s'); 

if |B(0| < \U\ then 

Af' <r- the set of all networks obtainable from N{1, s) by allocating 
all taxa in B{1) to sides in U; 
Af<^AfuAf'; 



else 



if N{1, s) does not represent C restricted to X{N) U {(} then 
Af ^ N, where s is declared s as finished; 
foreach side s' a U do 
I Af -^Af \JN(l,s'); 



else 



D an arbitrary set of \U\ taxa such that D f] X = 

N*(l, s) 4— a network obtained from N(l, s) by arbitrarily and 

bijcctively assigning each taxon in D to a side in U ; 

C* ^ {C e C such that x, e C, I C, and C C X{N)}; 

if N*(l,s) does not represent C* then 

A/" W, where s is declared s as finished; 

foreach side s' ^ U do 
Af ^AfUN{l,s'); 



else 

L Af- 



N{l,s) 



return Af: 
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of N exists where one or niore taxa are placed above Xi on s. Suppose this is 
wrong. In that case, observe that in every vahd completion / always has to be 
the taxon directly above Xi. Indeed, if there was some valid completion such 
that I is not directly above Xi, then there would exist some taxon V ^ I such 
that I' — >c Xi (from Observation [2]) and I -^c I' (as before, this follows from 
the fact that U — % and from Observations [2] and [s]). This would mean that 
I ^ L\ contradiction. So we assume that I is directly above xt. Now, since 
B{1) ^ 0, then there is some cluster in the input that contains x^, does not 
contain Z, and contains some not-yet allocated taxon distinct from I. From 
Observation [4| the only edge that can represent such a cluster is the edge e 
between the parents of Xi and /. But all the clusters represented by e consist 
only of already-allocated taxa, because C/ = 0. This means that adding I on 
side s will only lead us to construct non-valid completions. Hence we conclude 
that, if B{1) ^ 0, all valid completions of N do not contain any other taxon 
on s and ending the side s is the right choice. 

Now consider the case B{1) — and let C be C restricted to X{N) U {I}. 
If N{1, s) does not represent C we are definitely correct to declare the 
side s as finished and return N . Indeed, all valid completions of N do not 
contain any other taxon on s. Suppose it is not correct. Then there exists 
a valid completion of TV where at least one taxon is above Xi on s. Again, 
for the same reasons as above we assume that I is always the taxon directly 
above Xi. Since N(l,s) does not represent C , this incompatibility cannot be 
eliminated by adding more taxa, hence we conclude that there are no valid 
completions of N with taxa above Xi on side s. Hence, ending the side s is the 
only correct option. Suppose now that N{1, s) does represent C'; Algorithm 
[TJadds I above Xi on side s, and does not declare s as finished. This conclusion 
can only be incorrect if all valid completions require that I is not directly 
above Xi . We observe that in any valid completion of N there can be no taxon 
I' ^ I directly above Xi on s, because otherwise, as before, since C/ = we 
will have that I — I' -^q and hence I ^ L' , contradiction. So all valid 
completions terminate the side at x,. Let N' be an arbitrary valid completion 
of iV and denote by N" the network obtained from N' by moving I, wherever 
it is, just above Xi. Firstly, we claim that N" still represents C. Recall that 
I — >c Xi, so the only potential problem is with clusters in C that contain Xi 
but do not contain I. Let C be such a cluster not represented by N" . Suppose 
C % -^(-/V) U But in this case we would have B(Z) 7^ in A^, contradiction. 
So the only possibility is that C C A'(A) U {I}. Clearly G was in C and 
was thus represented by A^(Z, s). Moreover, from Observation |4] the edge that 
represents C in N(l, s) is the edge between the parents of I and Xi. Given that 
[/ = 0, no more taxa can be added "underneath" side s and this edge still 
represents C in N" because N' is a valid completion of A^. Hence moving I in 
the way is safe in terms of cluster representation. 

Secondly, we claim that moving I in this way does not alter the side types 
i.e. the empty /short /long sides before moving I remain empty/short/long after 
moving I. To see this, note that moving / from its original location reduces the 
number of taxa by 1 on some side, and increases the number of taxa of s 
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by 1. Side s is by assumption already long, so remains so. The side of A''' 
containing I cannot change from being long to being short in N" , because this 
lowers the total number of long sides, and by assumption the pair (G, Sq) 
underlying N is side-minimal. Similarly it cannot change from being short to 
being empty, because this leaves the number of long sides the same but reduces 
the number of short sides, again contradicting the assumption that {G,Sg) is 
side-minimal. Combining these two claims - that moving I is safe for cluster 
representation and does not alter the side types nor the underlying generator 
- let us conclude that there is a valid completion for N in which I is placed 
directly above Xi. Hence it is correct to add I above Xi on side s, and does not 
declare s as finished. 

Case [7^0. The case \L'\ — is identical to the corresponding subcase 
when U ~ %. This means that in this case it is always correct to declare the 
side s as finished and return N . 

Consider now the case \L'\ > 2. Observe firstly that, if some taxon I G L' 
is placed directly above x^, then all remaining taxa in L' must be allocated to 
sides in U. To see why this is, note that for every I' G L' we have that I' — >-c Xi. 
So, a I' I is placed above Z on s or on a side not in U, then, from Observation 
[2] and [I we would have that I' -^c ' ^i, contradicting the fact that I' is in 
L' . We only need to show that, if a valid completion for N exists, then the set 
M contains a network for which there exists a valid completion. Note that Af 
contains (line 20) N, where s is declared as finished, (lines 21-23) all possible 
networks obtained from N by allocating all taxa in L' to sides in U and (lines 
24-27) all possible networks obtained from N{1, s) by allocating all taxa in 
L' \ {1} to sides in U, iterating over all I G L' . The only case that these three 
sets do not describe, is when every valid completion has a taxon p ^ L' directly 
above Xi, but at least one taxon I G L' is not mapped to U. But this implies, 
similarly to the case \U\ = 0, that I — >c P —>-c Xi, so I ^ L' , contradiction. 
Hence this case cannot happen, and the three sets actually describe all possible 
outcomes in this situation. So at least one of them will contain a network with 
a valid completion in the case N does have a valid completion. 

Consider now the case \L'\ — 1. We begin with the subcase B{1) ^ 0. 
Similar to previous arguments wc know that, if we place I (the only element 
in L') directly above Xi, all taxa in B{1) have to be allocated to U. This 
holds because, from Observation |4j any cluster that contains Xi but not I is 
represented by the edge between the parents of I and Xi. If B{1) ^ 0, the set 
A/" is composed of (line 33) N, where s is declared as finished, (lines 34-35) all 
possible networks obtained from N by allocating I to a side in U and (lines 
36-38) all possible networks obtained from N{1, s) by allocating all taxa in 
B{1) to sides in U. Observe that the only situation that these three guesses do 
not describe, is when some taxon p ^ I is placed above Xi and I is not mapped 
to U. But in this case we would have that I -^c P ^i, contradicting the 
fact that I is in L' . So Af does again describe all possible outcomes. 

This leaves us with the very last subcase, \L'\ = 1 and B{1) — 0. The sub- 
case when N{1, s) does not represent C restricted to X{N) U {1} is actually 
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fairly straightforward. It is clear that / cannot be placed in this position in a 
valid completion. Hence the only two situations that line 41 and lines 42-43 
do not describe, is when some element p ^ I is placed directly above Xi, and I 
is not mapped to U. But, as before, this implies that I — >c P — >c which as 
we have seen is not possible. So the only remaining subcase is when \L'\ = 1, 
B{1) = and N{1, s) does represent C restricted to X{N) U {I}. Now, 
consider the network N*{l,s). Informally the dummy taxa in N*{l,s) act as 
"placeholders" for taxa that will only later in the algorithm be mapped to U. 
We do not know exactly what these taxa will be, but we know that they will 
definitely be there. Consider a cluster C G C*. If N*{l,s) does not represent C 
then this must be because of the dummy taxa, because we know that N{l,s) 
did represent C restricted to A'(A^) U {I}. Note that this holds irrespective of 
the true identity of the dummy taxa. Hence, C will never be represented by 
any completion of N{l,s). For this reason we conclude that, if N*{l,s) does 
not represent C* , it is definitely correct to declare the side s finished (line 49) 
or allocate I to a side in U (lines 50-51). 

Finally, suppose N*{l,s) does represent C* . This is the flip-side of the 
previous argument. Whatever the true identity of the dummy taxa, every valid 
completion of N{1, s) will represent every cluster in C* . Let N' be an arbitrary 
valid completion of N and denote by TV" the network obtained from N' by 
moving wherever it is, just above Xi. Now, as we did earlier we argue that in 
this case it is "safe" to put I directly above Xi. Indeed, because B{1) = 0, the 
only clusters that might not be represented in N" are clusters in C* . But we 
have shown that when / is placed directly above Xi all the clusters in C* are 
represented regardless of how we complete the rest of the network. Secondly, 
we argue just as before that moving I in this way cannot alter the side types. So 
if we choose I as Xi+i there must still exist a valid completion. This concludes 
the proof of the lemma. □ 

Lemma 4 Let C be a separating set of clusters on X and let k be the first in- 
teger for which a level-k network representing C exists. Let N be an incomplete 
network such that its underlying generator G and set of side guesses Sq are 
such that (G, So) is side-minimal w.r.t. C and k, and let s be an active side of 
N that contains only a single taxon. Then, if a valid completion for N exists, 
Algorithm^ computes in f(k) ■ poly{n) time a set of (incomplete) networks M 
for which s is a finished side, such that Af contains at least one network for 
which there exists a valid completion. 

Proof. The correctness follows from Lemma|3] We now prove the running time. 

First, note that the size of the set A/" returned by Algorithm [l] is bounded 
by f{\U\). This is evident for the sets Af constructed on lines 34-35, 42-43 
and 50-51 but it holds also for the sets Af' constructed respectively on lines 
21-23, 24-27 and 36-38, since these sets are constructed only if, respectively, 
\L'\ < \U\, \L'\ -1 <\U\ or \B{1)\ < \U\. Since in all other cases \Af\ = 1, 
the size of the set J\f returned by Algorithm [T] is indeed bounded by f{\U\). 
Moreover, from Proposition [l] it follows that the running time of Algorithm [l] 
is /(fc) ■poly{n). 
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Second, note that, each time that Algorithm [T] returns a set of networks 
J\f such that \Af \ > 1, \U\ decreases or s is declared as finished. Additionally, 
when U — and \Af — 1\, Algorithm [T] returns only one network per call and 
we have at most 0{n) of these calls (because either s is declared finished or a 
new taxon is added to s). 

Since the number of sides in a generator is bounded by f{k) and C/ is a 
subset of the short sides of the generator (which follows from the fact that 
all long sides reachable from s are assumed to be finished), we have that \U\ 
is bounded by /(fc). Thus, the running time of Algorithm [2] is bounded by 
f{k)-poly{n). □ 



Algorithm 2: completeSide(-/V, s) 



1 Af ^ N; 

2 while there exists N & Af such that s is not finished in N do 
3 
4 



Af^Af\N; 

Af ^ Af U addOnSide(Af, s); 



We will subsequently use the term lowest side to denote a long side that 
does not yet have all its taxa (i.e. an unfinished long side), and such that there 
is no other long side s' ^ s with this property that is reachable from s. 

The following lemma is basically the fixed parameter tractable version of 
Lemma 3 from |19) : 

Lemma 5 Let C he a separating set of clusters on X . Then, for every fixed 
k >0, Algorithm^ determines whether a level-k network exists that represents 
C, and if so, constructs such a network in time f{k) ■poly{n). 

Proof. Algorithm |3] starts by choosing a level-fc generator G and a set of side 
guesses Sq- Note that (see lines 1-2) generators and sets of guesses are analyzed 
in such a way that generators with a smaller number of sides and sets of side 
guesses with a smaller number of long sides, and (to further break ties) short 
sides, are analyzed first (this is the meaning of the expression "in increasing 
side order"). This implies that the side-minimal pair (G,Sg), if any exists, is 
analyzed before any other pair {G',Sq) for which a valid completion exists. 
This is done to be able to apply Lemma |4] 

Then (see lines 4-18), the algorithm constructs a set of complete networks, 
i.e. simple level-fc networks where each short side has received a single taxon 
and each long side at least two, and returns the first of them that represents 
C, if any exists. On line 6, X{s~) denotes the set of all taxa in X that are 
candidates to be the first taxon on side s, which we call s~ . We will discuss 
this set in more detail shortly. 

Note that lines 10 and 14 of the algorithm are only a technical step. Indeed, 
when we declare a side s as finished, we assume that we will never alter that 
side again. Hence it does not change the analysis if we collapse all the taxa 



Constructing minimal softwired networks is fixed parameter tractable 



19 



Algorithm 3: ComputcLcvel-fc(C) 



Data: A separating set of clusters C on A' 

Result: A level-fc network representing C, if any exists 

foreach level-k generator G in increasing side order do 

foreach set of side guesses Sq in increasing side order do 

while there exists N a M such that N contains a lowest side s do 

foreach s~ S X[s~) do 

A^' completeSide(A''(s~,s),s); 

M" <— the networks in M' where s contains more than one taxon; 
foreach N e N" do 

collapse all taxa on side s into a single meta-taxon S and 
adjust the cluster set accordingly; 

_ A^^A^UA^"; 

if lA"! > then 

while there exists N & Af do 

dc-coUapse the collapsed sides; 

J\f' <— the set of networks obtainable from A'^ by allocating the taxa 
in A' \ X{N) to any short sides that have not yet been allocated a 
taxon; 

if there is a network N' £ Af' representing C then 
return N'; 
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on side s into a single meta-taxon. That is, if we have decided that the taxa 
on the side s are s~, xi, . . . , cc/ we simply replace all these taxa by a single 
new taxon S and replace s~, xi, . . . , x/ by S in any clusters in C that they 
appear in (line 10). This collapsing step is simply a convenience to ensure that 
the set of sides reachable from the current lowest side are always empty or 
short sides. This will be helpful when proving the running time of Algorithm 
[3] see below. When we are finished allocating all the taxa in X and are ready 
to check whether the resulting final network represents C we can simply de- 
collapse all the S i.e. "unfold" all the long sides that we have collapsed (line 
14). This means that the correctness of Algorithm [s] follows by Observation [l] 
and Lemma m 

We now need to prove the correctness of the running time. First, note that 
the number of pairs {G,Sg) to consider is bounded by f{k) since both the 
number of generators and the number of sides per generator are bounded by 
fik). 

We now need to prove that the size of X{s~) is at most f{k) for all sides 
s i.e. that the number of taxa that might be the first taxon s" on side s, is 
not too big. So let s~ be any taxon which can fulfil this role, and let x be 
the taxon directly above s" on side s. (The taxon x must exist because we 
assume that s is long). Clearly, x — >c s • By line 10, we have that the only 
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sides reachable from side s are short and empty sides. Moreover, we know 
from Observation [5] that, because C is separating, there is some non-singleton 
cluster C S C such that s~ G C but x ^ C. By Observation |4j such a cluster 
C has to be represented by the edge e between the parents of x and s~ . Now, 
any cluster represented by e can only contain taxa that are reachable from e 
by a directed path. The only sides that are reachable from side s are short 
and empty sides, so the cluster C can only contain at most f{k) taxa (because 
there are at most f{k) short sides). So we know that s~ is in some cluster C, 
and that C is "small" in the sense that its size is bounded above by f{k). So 
if we take all "small" clusters, and let X{s~) be their union, we know that we 
could simply try taking every element in X{s~) and guessing that it is equal 
to s~ . To ensure that we do not use too many guesses, we have to show that 
|A'(s~)| is bounded by /(fc). To see that this holds, consider the question: how 
many clusters in C contain at most c taxa for a constant c? Observe that on 
every long side only the c taxa furthest away from the root are potentially in 
such clusters. Any taxon closer to the root on a long side cannot possibly be 
in a cluster of size at most c, because if it is in a cluster then so are at least 
c other taxa too. Hence there are at most /(fc) taxa that can be involved in 
"small" clusters: the taxa on the short sides and the taxa at the bottom of the 
long sides. So we have that |A'(s~)| is bounded by f{k) and we can guess s~ 
with at most f{k) guesses. 

The collapsing and de-collapsing steps (lines 10 and 14) can be done in 
f{k) ■ poly(n) time, as well as completing each side s (line 7), by Lemma [4] 
Moreover, by Proposition [l| also checking whether N represents C takes /(fc) • 
poly{n) time. Additionally, the allocation of remaining taxa to the unfinished 
short sides (line 15) takes a time bounded by f{k). Indeed, if we have correctly 
guessed all the taxa on the long sides, then there will be at most f{k) taxa left 
that have not yet been assigned to a side, and these will correspond to taxa 
on (a subset of) the short sides. (Some of the short sides might already have 
been allocated taxa by Algorithm [T]). 

This means that, if the size of Af is bounded by /(fc), then the entire 
algorithm can be executed in /(fc) ■ poly{n) time. And this is indeed the case, 
since |A'(s^)| and (summing over all iterations) the total number of lowest 
sides are bounded by f{k) and each time that a side is completed, the number 
of unfinished long sides decreases by 1. 

□ 



3.2 From simple networks to general networks 

To prove the fixed parameter tractability of constructing general level-fc net- 
works, we need to introduce a few other concepts. The most important is the 
concept of a decomposable network. 

Definition 2 Let C be a set of clusters on a taxon set X with incompatibility 
graph IG{C) and let iV be a phylogenetic network that represents C is 
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said to be decomposable w.r.t. C if and only if there exists a cluster-to-edge 
mapping a : C — )■ E{N) such that, for any two clusters Ci,C2 G C, Ci and 
C2 lie in the same connected component of IG{C) if and only if the two tree 
edges a(Ci) and a{C2) that represent Ci and C2 are contained in the same 
biconnected component of N. 

The following observation is straightforward: 

Observation 6. Let C be a set of clusters on a taxon set X with incom- 
patibility graph IG{C) and let N be a phylogenetic network N representing C 
that is decomposable w.r.t. C. Then, we have that the number of biconnected 
components of N is equal to the number of connected components of IG(C). 

Let C be a set of clusters on X with incompatibility graph IG{C). The set 
of backbone clusters associated with C is defined as 

B{C) = {X{C') I C is a connected component of /G(C)}, 

where X{C') = UceC'C' denotes the set of all taxa in C . Since the set B{C) is 
compatible [M], we have the following result: 

Proposition 3 Given a decomposable level-k network N representing a set of 
clusters C on X , N can contain at most 2'^+'^ • {n — 1)^ clusters. 

Proof. The fact that B{C) is compatible ensures that the size of B{C) is at 
most 2(n — 1). In the following we will prove that the number of connected 
components of IG{C) is at most 4(n — 1). To prove this, we show that it is 
impossible to have two non-trivial connected components of IG{C) (i.e. two 
connected components containing more than once cluster each), say C and C, 
such that X{C) — X{C'). For the sake of contradiction, let us suppose that 
two such components C and C exist. Let Gi £ C and G^ G C be two clusters 
such that Ci n C( ^ 0. Since Ci and G[ are compatible, we can suppose 
w.l.o.g that G'l C Ci. Let C2 be another cluster of C incompatible with G'l 
{G'2 exists because C' is not trivial). Then, since G[ G2 7^ we have that 
Ci n C2 ^ 0. But C2 cannot be a superset of Ci so we have that G'2 C Ci. 
Reiterating this reasoning we obtain that X{C') = Gi. Since C is not trivial, 
there exists another cluster C2 in C that is incompatible with Ci. So there 
exists at least one taxon in X{C) that is not in Ci and we cannot have that 
X(C) — X{C'), contradiction. This means that each non-singleton backbone 
cluster can correspond to two connected components, one trivial and one not. 
Then we have at most 4(n— 1) connected components in IG{C). By Observation 
[6j this implies that TV has at most 4(n — 1) biconnected components. 

We now prove that each biconnected component B of N can represent at 
most 2'=+i(n- 1) clusters. To see that, let us denote by V' the set of nodes of 
N that are not in B but whose parents are in B and, for each v G V , denote 
by X{v) the set of all leaves in iV that are reachable by directed paths from 
V. It is easy to see that the set V has a particularity: for each node v G V we 
have that, no matter which switching Tn is chosen, there exists a path in the 
switching between v and each taxon u e X{v). Indeed, if this was not true. 
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we will have that v has to be in _B, a contradiction. Then the network N can 
be modified in the following way: For each node v € V , label it with the set 
X{v) and delete all the outgoing edges of v. Let N' be the rooted phylogenetic 
network rooted at the root of the biconnected component B. Because of the 
peculiarity of the nodes in V' , N' represents the same cluster set than B in 
N. With a line of reasoning similar to that used in Proposition [T] it is easy 
to see that B can represent at most 2'"'+^(n — 1) clusters in N' , and thus also 
in N. This concludes the proof. □ 

The following theorem ensures that we can focus on decomposable level-/c 
networks: 

Theorem 1 ( [30] ) Let C be a set of clusters. If there exists a level-k network 
representing C , then there also exists such a network that is decomposable w.r.t. 
C. 

We can now prove the main result of the section: 

Theorem 2 Let C be a set of clusters on X . Then, for every fixed k > 0, it 
is possible to determine in time f{k) ■poly{n) whether a level-k network exists 
that represents C, and if so to construct such a network. 

Proof. By Theorem [T] we know that we can construct a decomposable level-A: 
network using a divide-and-conquer strategy. A possible approach is described 
in Section 8.2 of [13]. This approach divides C in g subsets, where g is the 
number of connected components of the incompatibility graph. Then, each 
subset Ci is made separating w.r.t. X[Ci) by merging every subset of X{Ci) 
that is compatible with Ci (see [14| for more details). Then a local network 
is computed for each Ci and finally all the networks are merged together in 
a global level-fc network representing C. Since by Proposition [3] we can only 
have a f{k) ■ poly{n) number of clusters and a pol(n) number of connected 
components in IG{C), it is easy to see that the merging of the taxa in each Ci 
and the merging of all partial networks into the global one can be conducted 
in /(fc) ■ poly{n) time. Moreover, since each subproblem Ci is separating, from 
Lemma [5] we have that constructing each local network takes fik) ■ poly{n) 
time. This concludes the proof. □ 

4 Minimizing reticulation number is fixed parameter tractable 

The aim of this section is to show that reticulation number minimization 
is fixed parameter tractable. As pointed out for level minimization at the 
beginning of Section [3] it is sufficient to prove that, given a set of clusters C 
on taxon set A", we can construct a phylogenetic network representing C with 
reticulation number r (if any exists) in time at most /(r) • poly(n). 

To show the main result of this section we will introduce the concepts of 
ST-collapsed cluster sets and of r -reticulation generators. We will then prove 
that all the results and algorithms used in the previous section to prove that 
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constructing simple level-fc networks is fixed parameter tractable, hold not 
only for separating cluster sets and level-A: generators but also for ST-collapsed 
cluster sets and r-reticulation generators. The main difficulty is to show that 
several key utility results still hold, since the others results do not exploit the 
biconnectedness of simple level-fc generators. For ease of reading we will refer to 
the extended versions of these results using their original name followed by the 
term "(extended)" (e.g. "Proposition 1" becomes "Proposition 1 (extended)"). 

Observation 7. Let N be a network on X with reticulation number r. Then 
N represents at most 2^^^{n — 1) clusters. 

Proof. From Lemma [T] we may assume without loss of generality that N is 
binary. A binary network with reticulation number r contains exactly r retic- 
ulation nodes. Hence N displays at most 2^ trees, and each tree represents 
at most 2{n — 1) clusters (because a rooted tree on n taxa contains at most 
2(n - 1) edges). □ 

We can thus henceforth assume that \C\ < 2^^^{n — 1) i.e. that C contains 
at most f{r) ■ poly{n) clusters. Then we have that the following holds: 

Proposition 1 (extended). Let N be a network on X with reticulation num- 
ber at most r. Then, given a cluster set C, we can check in time f{r) ■ poly{n) 
whether N represents C. 

Given a set of taxa S C X, we use C \ S" to denote the result of removing 
all elements of S from each cluster in C and we use CIS* to denote C\{X\S) 
(i.e. the restriction of C to S). We say that a set C A" is an ST-set with 
respect to C, if 5* is compatible with C and any two clusters Ci, C2 G C\S are 
compatible [TS]. (We say that an ST-set S is trivial if S = ^ ot S = X). An 
ST-set S is maximal if there is no ST-set S" with S d S' . The following results 
from 19] will be very useful; 

Corollary 2 110. Let C be a set of clusters on X . Then there are at most n 
maximal ST-sets with respect to C, they are uniquely defined and they partition 
X. 

Lemma 6 \19^ The maximal ST-sets of a set of clusters C on X can be com- 
puted in polynomial time. 

Here "polynomial time" means poly{n, \C\), but given that the size of C 
is at most /(r) • poly{n) it follows that the maximal ST-sets of C can all be 
computed in time /(r) • poly{n). 

The following corollary says, essentially, that if we want to construct net- 
works with minimum reticulation number then it is safe to assume that each 
maximum ST-set corresponds to (the taxa in) a subtree that is attached to 
the main network via a cut-edge. 

Corollary 3 \19j Let N be a network that represents a set of clusters C . There 
exists a network N' such that N' represents C, r(N') < r{N), 1{N') < 1{N) 
and all maximal ST-sets (with respect to C) are below cut-edges. 
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Let S — {Si, . . . , Sm} be the set of maximal ST-sets of C. We construct 
a new cluster set C from C as follows. For each Sj € S, and for each clus- 
ter C in C such that Sj H C := S ^ 0, we replace the set S* in C by the 
new taxon Sj. In other words we "collapse" all taxa in each maximal ST- 
set into a single new taxon that represents that ST-set. We say that C is 
the ST-collapsed version of C. Note that a separating cluster set C is neces- 
sarily ST-coUapsed but the opposite implication does not hold. For example 
C = {{a, b}, {b, c}, {a, b, c, d}, {d, e}} on X = {a, b, c, d, e} is ST-collapsed but 
not separating because {a,6, c} is compatible with C. 

Observation 5 (extended). LetC be a ST-collapsed cluster set on X . Then 
every size-2 subset of X is incompatible with C. 

Proof. Suppose that this is not true and there exists a size-2 subset of X, say A, 
that is compatible with C. Since any two clusters Ci,C2 S C\A are necessarily 
compatible, ^ is a ST-set, contradicting the fact that C is ST-collapsed. □ 

Lemma 7 LetC be a cluster set on X , and letC be the ST-collapsed version of 
C. Then any network N' that represents C can be transformed into a network 
N that represents C, such that r{N) = r(N') in f{r{N)) ■ poly{\X\) time. 

Proof. Let S = {5*1, ... , Sm} be the set of maximal ST-sets of C. For each 
Sj G S we replace the taxon Sj in N' with the tree on taxon set Sj that 
represents exactly the set of clusters C\Sj. □ 

Corollary 4 LetC be a cluster set on X , and letC' be the ST-collapsed version 
ofC. Thenr{C') = r{C). 

Proof. Lemma It] tells us that r(C) < r{C'). To see that r(C') < r(C), observe 
that Corollary [3|allows us to assume the existence of a network N with reticu- 
lation number r(C) such that all the maximal ST-sets of C are below cut-edges 
in N . If, for each maximal ST-set Sj of C, we replace the subtree corresponding 
to Sj with a single taxon Sj , we obtain a network with reticulation number at 
most r(C) which represents C . □ 

Combining the fact the transformation described in the proof of Lemma [7] 
can be executed in time /(r) ■ poly{n) with Lemma[7]and Corollary |4] we may 
thus henceforth restrict our attention to ST-collapsed cluster sets. Networks 
that represent ST-collapsed cluster sets have a rather restricted topology, as 
the following lemma shows. 

Lemma 8 Let C be an ST-collapsed cluster set on X , and let N be a binary 
network that represents C. Then it follows that, for each cut edge {u,v) of N , 
either v is a leaf labelled by a taxon from X , or there is a directed path starting 
from V that can reach a reticulation node. 

Proof. Let X{v) C A" be the set of taxa reachable from v by directed paths. 
If I > 2, but there are no reticulation nodes reachable by directed paths 
from V, then the subnetwork rooted at v is actually a tree with taxon set X{v), 
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meaning that X{v) is an ST-set of cardinality 2 or higher. This violates the 
ST-collapsed assumption, giving a contradiction. If |A'(ti)| = 1 then it follows 
that either is a leaf labelled by a taxon, or (due to the fact that N is binary 
and contains no nodes with indegree and outdegree both equal to 1) at least 
one reticulation node is reachable from w by a directed path. □ 

We are now (finally) ready to define a r- reticulation generator. This is very 
closely related to the level-fc generator discussed in SectionjS) The only signifi- 
cant difference is that r-reticulation generators do not have to be biconnected, 
and (for technical reasons) the inclusion of a "fake root" . 

Definition 3 An r-reticulation generator is a directed acyclic multigraph, 
which has a single node of indegree 0, called the fake root, and this has out- 
degree 1; precisely r reticulation nodes (indegree 2 and outdegree at most 1), 
and apart from that only nodes of indegree 1 and outdegree 2. 

Note that this definition implies that a r-reticulation generator cannot 
contain any leaf. As in the case of level-Zc generator, nodes with indegree 2 
and outdegree as well as all edges are called sides. Figure |4] shows the single 
1-reticulation generator and the seven 2-reticulation generators. 

Lemma 9 There are at most f{r) r-reticulation generators and each r- 
reticulation generator contains at most f{r) sides. 

Proof. In Lemma 1 of [15] it is proven that a \eve\-k generator has at most 
3fc — 1 vertices and at most 4fc — 2 edges. The proof there does not exploit the 
biconnectedness of level- A: generators, so - with the exception of the fake root 
- also holds for r-reticulation generators. By adding 1 to both the vertex and 
edge upper bounds to account for the fake root we come to upper bounds of 
3r and 4r — 1 respectively. Hence we obtain 7r — 1 as a crude upper bound 
on the number of sides in a generator. To see that there are at most /(r) 
r-reticulation generators observe that between any pair of nodes u and v in 
the generator there is either no edge, an edge from u to v (or from v to u), 
or a multi-edge from u to v (or from v to u). Hence there are at most S^'^' -' 
r-reticulation generators. □ 

Definition 4 The set A/''' (for r > 1) is defined as the set of all binary net- 
works that can be constructed by choosing some r-reticulation generator G, 
then applying the leaf hanging transformation described in Definition [T] and fi- 
nally deleting the fake root (i.e. the single vertex with indegree and outdegree 
1) and its incident edge. 

Lemma 4 (extended). Let C be an ST-collapsed set of clusters on X , such 
that r[C) > 1. Then there exists a network N in M"^^^^ such that N represents 
C. 

Proof. Let N be any binary network with reticulation number r(C) such that 
N represents C. We show how applying the reverse of the transformation de- 
scribed in Definition H to N will give some r(C)-reticulation generator G. The 
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lemma will then follow. We begin by adding a fake root to N i.e. a new vertex 
u' and an edge from u' to the root of N. (This is the inverse of deleting the 
fake root). We then delete all the leaves in N. Any nodes that are created with 
indegree 2 and outdegree we leave as they are (this is the inverse of step 3 
of Definition [T]). Nodes with indegree 1 and outdegree cannot be created, 
because this would require that there exists a node v in N which has indegree 
1 and outdegree 2 such that both its children are leaves labelled by taxa. But 
this would mean that v is the head of a cut-edge e where e violates the condi- 
tion described in Lemma [8j Now, consider the nodes that have been created 
with indegree and outdegree both equal to 1. Let u be any such node, and let 
U = {u}. Whenever U contains a node u whose unique parent p{u) also has 
indegree and outdegree both equal to 1, add p{u) to U. Whenever U contains 
a node u whose unique child c{u) also has indegree and outdegree both equal 
to 1, add c{u) to U. We continue expanding U this way until it cannot grow 
anymore. Clearly U stops growing at the point that U contains two nodes Utop 
and Ubottom (where possibly Utop — utottom) such that the parent of Utop (re- 
spectively, child of Ubottom) does not have indegree and outdegree both equal 
to 1. We suppress all the nodes in U, in the usual sense. Note, crucially, that 
this does not affect the indegree or outdegree of the parent of Utop or the child 
of Ubottom- While N still contains nodes of indegree 1 and outdegree 1 we re- 
peat the above process, until none are left; this is the inverse of steps 1 and 2 
of Definition [I] (Note that this process might create multi-edges, but because 
it leaves the indegree and outdegree of unsuppressed nodes intact, there will 
be at most two edges between any two nodes). Now, let G be the resulting 
structure. Observe that the reticulation number of G is the same as A'^, that 
every node in G with indegree 2 has outdegree or 1, that G contains a sin- 
gle fake root, that all nodes in G with indegree 1 have outdegree 2, and that 
G contains no leaves. We conclude that G is a r-reticulation generator and 
that we could have constructed N by applying the transformation described 
in Definition H to G. □ 



Proposition 2 (extended). Given a ST-collapsed cluster setC on X and an 
ordered set of distinct taxa of X {xi, . . . , xj.) such that k > 2 and x^ — J-^ Xi^i 
for 1 < i < (A: — 1). Then Xk -f^c ^i- 

Proof. If Xi — >c for 1 < i < (k—l) and Xk — >c ^ii this means that the set 
X' = U^Lj^Xi is compatible with C. Moreover, every non-singleton cluster that 
contains one element of X' , contains them all. So every non-singleton cluster 
C £ C is either disjoint from X' , or contains it, from which we conclude that 
any two clusters Ci, C2 G C\X' are compatible. So X' is a ST-set and we have 
a contradiction. □ 

Since the number of r-reticulation generators and the number of sides in 
a generator is bounded by /(r) from Lemma [9] and we have extended sev- 
eral critical utility results to also apply to ST-collapsed cluster sets and r- 
reticulation generators, we have the following: 
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Fig. 6 The single 1-reticulation generator and the seven 2-reticulation generators. 

Lemma 7 (extended). Let C be a ST-collapsed set of clusters on X . Then, 
for every fixed r > 0, Algorithm determines whether a network that repre- 
sents C with reticulation number equal to r exists, and if so, constructs such a 
network in time f{r) ■poly{n). 

From Lemma [7| and Corollary [4] we may finally conclude the following. 

Theorem 3 Let C be a set of clusters on X . Then, for every fixed r > 0, it is 
possible to determine in time f{r) ■ poly{n) whether a network that represents 
C with reticulation number at most r exists. 

5 Conclusions and open problems 

In this article we have shown that, under the softwired cluster model of phy- 
logenetic networks, constructing networks with minimum reticulation number 
(respectively, level) is fixed parameter tractable where the reticulation num- 
ber (respectively, level) is the parameter. The obvious problem with the algo- 
rithms in this article is that the part of the running time that depends only on 
the parameter is massively exponential. This contrasts with fixed parameter 
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tractable algorithms for combining two trees into a phylogenetic network. In 
this literature the dependence on the parameter is more modest: for example 
in [1] an algorithm for two binary trees with running time of 0((14r)'' • nP) 
is given (where rt = and r is the hybridization number of the two trees). 
However, the two tree case is rather special [2811191 and it is likely that, as the 
number of trees increases, the dependence on the parameter will also increase 
dramatically. Relatedly, there is still no fixed parameter tractable algorithm 
for combining an arbitrary set of trees into a phylogenetic network using a 
minimum number of reticulations. Could the ideas presented in this article - 
in particular, the use of generators - offer a theoretical route to this result? 
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